by the Oracle AutoMLx Team
AutoMLx Text Classification Demo version 23.1.1.
Copyright (c) 2023 Oracle, Inc.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/
In this notebook we will build a classifier using the Oracle AutoMLx tool for the public 20newsgroup dataset. The dataset is a binary classification dataset, and more details about the dataset can be found at http://qwone.com/~jason/20Newsgroups/. We explore the various options provided by the Oracle AutoMLx tool, allowing the user to exercise control over the AutoML training process. We then evaluate the different models trained by AutoML. Finally we provide an overview of the possibilites that Oracle AutoMLx offers for explaining the predictions of the tuned model.
Data analytics and modeling problems using Machine Learning (ML) are becoming popular and often rely on data science expertise to build accurate ML models. Such modeling tasks primarily involve the following steps:
All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best feature subset, model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoML can quickly (faster) jump-start the datascience process with an accurately-tuned model and appropriate features for a given prediction task.
! pip install seaborn==0.12.1
%matplotlib inline
%load_ext autoreload
%autoreload 2
Requirement already satisfied: seaborn==0.12.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (0.12.1) Requirement already satisfied: numpy>=1.17 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from seaborn==0.12.1) (1.22.2) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from seaborn==0.12.1) (3.5.1) Requirement already satisfied: pandas>=0.25 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from seaborn==0.12.1) (1.4.1) Requirement already satisfied: packaging>=20.0 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (21.3) Requirement already satisfied: pillow>=6.2.0 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (9.0.1) Requirement already satisfied: kiwisolver>=1.0.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (1.1.0) Requirement already satisfied: fonttools>=4.22.0 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (4.29.1) Requirement already satisfied: pyparsing>=2.2.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (2.4.7) Requirement already satisfied: python-dateutil>=2.7 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (2.8.2) Requirement already satisfied: cycler>=0.10 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (0.10.0) Requirement already satisfied: pytz>=2020.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from pandas>=0.25->seaborn==0.12.1) (2022.1) Requirement already satisfied: six in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from cycler>=0.10->matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (1.16.0) Requirement already satisfied: setuptools in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from kiwisolver>=1.0.1->matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (62.3.4) [notice] A new release of pip available: 22.2.2 -> 23.0 [notice] To update, run: pip install --upgrade pip
Load the required modules.
import gzip
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
# Settings for plots
plt.rcParams['figure.figsize'] = [10, 7]
plt.rcParams['font.size'] = 15
sns.set(color_codes=True)
sns.set(font_scale=1.5)
sns.set_palette("bright")
sns.set_style("whitegrid")
import automl
from automl import init
We start by reading in the dataset from sklearn. The dataset has already been pre-split into training and test sets. The training set will be used to create a Machine Learning model using AutoML, and the test set will be used to evaluate the model's performance on unseen data.
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')
target_names = train.target_names
X_train, y_train = pd.DataFrame(train.data), pd.DataFrame(train.target)
X_test, y_test = pd.DataFrame(test.data), pd.DataFrame(test.target)
column_names = ["Message"]
X_train.columns = column_names
X_test.columns = column_names
Lets look at a few of the values in the data. 20 NewsGroup is a classification dataset made of text samples. Each sample has an associated class (also called topic), which can be one of the followings:
train.target_names
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
We display some examples of data samples.
X_train.head()
| Message | |
|---|---|
| 0 | From: lerxst@wam.umd.edu (where's my thing)\nS... |
| 1 | From: guykuo@carson.u.washington.edu (Guy Kuo)... |
| 2 | From: twillis@ec.ecn.purdue.edu (Thomas E Will... |
| 3 | From: jgreen@amber (Joe Green)\nSubject: Re: W... |
| 4 | From: jcm@head-cfa.harvard.edu (Jonathan McDow... |
We restrict the data to the 4 topics that are related to science.
science_labels = [i for i in range(len(target_names)) if 'sci' in target_names[i]]
train_science_indices = [i for i in range(len(y_train)) if y_train.iloc[i][0] in science_labels]
test_science_indices = [i for i in range(len(y_test)) if y_test.iloc[i][0] in science_labels]
X_train = X_train.iloc[train_science_indices]
y_train = y_train.iloc[train_science_indices]
X_test = X_test.iloc[test_science_indices]
y_test = y_test.iloc[test_science_indices]
len(X_train)
2373
We further downsample the train set to have a reasonable training time for this demonstration.
X_train, _, y_train, _ = train_test_split(X_train, y_train,
test_size=0.5,
stratify=y_train,
random_state=42)
Finally we generate a validation set and only use that for internal pipeline validation.
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,
test_size=0.2,
stratify=y_train,
random_state=42)
The AutoML pipeline offers the function init, which allows to initialize the parallel engine. By default, the AutoML pipeline uses the dask parallel engine. One can also set the engine to local, which uses python's multiprocessing library for parallelism instead.
init(engine='local')
[2023-02-10 12:29:19,625] [automl.xengine] Local ProcessPool execution (n_jobs=40)
The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset. In particular it allows to find a tuned model for any supervised prediction task, e.g. classification or regression where the target can be binary, categorical or real-valued.
AutoML consists of five main modules:
All these pieces are readily combined into a simple AutoML pipeline which automates the entire Machine Learning process with minimal user input/interaction.
The AutoML API is quite simple to work with. We create an instance of the pipeline. Next, the training data is passed to the fit() function which executes the three previously mentioned steps.
A model is then generated and can be used for prediction tasks. We use the roc_auc scoring metric to evaluate the performance of this model on unseen data (X_test).
est1 = automl.Pipeline(task='classification')
est1.fit(X_train, y_train, X_valid, y_valid, cv=None, col_types=['text'])
y_predict = est1.predict(X_test)
score_default = f1_score(y_test, y_predict, average="micro")
print('F1 Micro Score on test data: {:3.3f}'.format(score_default))
[2023-02-10 12:29:19,999] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:29:20,031] [automl.pipeline] Dataset shape: (948, 1)
[2023-02-10 12:29:20,114] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:29:22,557] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (948, 11026), cv: None
[2023-02-10 12:29:28,698] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:29:28,699] [automl.pipeline] Running Model Selection
[2023-02-10 12:30:25,005] [automl.pipeline] Model Selection completed. Selected model: ['TorchMLPClassifier']
[2023-02-10 12:30:25,007] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (948, 11026), Valid Shape: (238, 11026), CV: None, Class counts: [237 237 237 237]
[2023-02-10 12:31:45,191] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (948, 11026), Valid Shape: (238, 11026), CV: None, Class counts: [237 237 237 237]
[2023-02-10 12:31:45,193] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (948, 11026)
[2023-02-10 12:34:46,199] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (948, 3972)
[2023-02-10 12:34:46,283] [automl.pipeline] Tuning TorchMLPClassifier
[2023-02-10 12:36:11,297] [automl.pipeline] Tuning completed. Best params: {'activation': 'relu', 'class_weight': None, 'dropout': 0.1, 'l2_reg': 0.0, 'nr_layers': 3, 'nr_neurons': 100, 'optimizer': 'adagrad'}
[2023-02-10 12:36:11,956] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:36:36,724] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:36:36,843] [automl.pipeline] AutoML completed. Time taken - 435.559 sec
F1 Micro Score on test data: 0.913
During the AutoML process, a summary of the optimization process is logged. It consists of:
AutoML provides a print_summary API to output all the different trials performed.
est1.print_summary()
| Training Dataset size | (948, 1) |
| Validation Dataset size | (238, 1) |
| CV | None |
| Optimization Metric | neg_log_loss |
| Selected Features | Index(['001230', '004418', '00r', '01', '01wb', '02', '022922', '032623', '0358', '0366', ... 'za', 'zabriskie', 'zcomm', 'zenier', 'zikzak', 'zisfein', 'zmodem', 'zoo', 'zoology', 'zrepachol'], dtype='object', length=3972) |
| Selected Algorithm | TorchMLPClassifier |
| Time taken | 411.3803 |
| Selected Hyperparameters | {'hidden_layer_sizes': [100, 100, 100], 'dropout': 0.1, 'activation': 'relu', 'loss': 'default', 'optimizer': 'adagrad', 'n_epochs': 200, 'batch_size': 256, 'l2_reg': 0.0, 'class_weight': None, 'random_state': 7} |
| AutoML version | 23.1.1 |
| Python version | 3.8.7 (default, Aug 25 2022, 13:59:56) \n[GCC 8.5.0 20210514 (Red Hat 8.5.0-10.1.0.1)] |
| Algorithm | #Samples | #Features | Mean Validation Score | Hyperparameters | CPU Time | Memory Usage (GB) |
|---|---|---|---|---|---|---|
| TorchMLPClassifier_HT | 948 | 3972 | -0.0589 | {'activation': 'relu', 'class_weight': None, 'dropout': 0.1, 'l2_reg': 0.0, 'nr_layers': 3, 'nr_neurons': 100, 'optimizer': 'adagrad'} | 15.8551 | (0.005779266357421875, None) |
| TorchMLPClassifier_HT | 948 | 3972 | -0.0740 | {'activation': 'relu', 'class_weight': None, 'dropout': 0.0012425681319433632, 'l2_reg': 0.0, 'nr_layers': 3, 'nr_neurons': 100, 'optimizer': 'adam'} | 14.7492 | (0.0, None) |
| TorchMLPClassifier_HT | 948 | 3972 | -0.0740 | {'activation': 'relu', 'class_weight': None, 'dropout': 1e-05, 'l2_reg': 0.0, 'nr_layers': 3, 'nr_neurons': 100, 'optimizer': 'adam'} | 15.3938 | (0.0, None) |
| TorchMLPClassifier_HT | 948 | 3972 | -0.0740 | {'activation': 'relu', 'class_weight': 'balanced', 'dropout': 0.1, 'l2_reg': 0.0, 'nr_layers': 3, 'nr_neurons': 100, 'optimizer': 'adam'} | 16.0629 | (0.0, None) |
| TorchMLPClassifier_HT | 948 | 3972 | -0.0740 | {'activation': 'relu', 'class_weight': None, 'dropout': 0.1, 'l2_reg': 0.0, 'nr_layers': 3, 'nr_neurons': 100, 'optimizer': 'adam'} | 17.7880 | (0.01073455810546875, None) |
| ... | ... | ... | ... | ... | ... | ... |
| TorchMLPClassifier_AdaBoostClassifier_FS | 948 | 1 | -1.4784 | {'activation': 'relu', 'class_weight': None, 'dropout': 0.1, 'l2_reg': 0.0, 'nr_layers': 1, 'nr_neurons': 100, 'optimizer': 'adam'} | 0.5818 | 0.0 |
| GaussianNB_AS | 948 | 11026 | -2.6122 | {} | 0.3027 | 0.0 |
| TorchMLPClassifier_HT | 948 | 3972 | -8.2554 | {'activation': 'tanh', 'class_weight': 'balanced', 'dropout': 5e-06, 'l2_reg': 0.0, 'nr_layers': 3, 'nr_neurons': 133, 'optimizer': 'rmsprop'} | 20.5501 | (0.00018310546875, None) |
| DecisionTreeClassifier_AS | 948 | 11026 | -10.0133 | {'class_weight': None, 'max_features': 1.0, 'min_samples_leaf': 0.000625, 'min_samples_split': 0.00125} | 0.6093 | 0.0 |
| TorchMLPClassifier_HT | 948 | 3972 | -12.1224 | {'activation': 'relu', 'class_weight': None, 'dropout': 0.1, 'l2_reg': 0.0, 'nr_layers': 3, 'nr_neurons': 100, 'optimizer': 'rmsprop'} | 16.7356 | (0.0, None) |
We also provide the capability to visualize the results of each stage of the AutoML pipeline.
The plot below shows the scores predicted by Algorithm Selection for each algorithm. The horizontal line shows the average score across all algorithms. Algorithms below the line are colored turquoise, whereas those with a score higher than the mean are colored teal. Here we can see that the TorchMLPClassifier achieved the highest predicted score (orange bar), and is chosen for subsequent stages of the Pipeline.
# Each trial is a tuple of
# (algorithm, no. samples, no. features, mean CV score, hyperparameters,
# all CV scores, total CV time (s), memory usage (Gb))
trials = est1.model_selection_trials_
colors = []
scores = [x[3] for x in trials]
models = [x[0] for x in trials]
y_margin = 0.10 * (max(scores) - min(scores))
s = pd.Series(scores, index=models).sort_values(ascending=False)
for f in s.keys():
if f == '{}_AS'.format(est1.selected_model_):
colors.append('orange')
elif s[f] >= s.mean():
colors.append('teal')
else:
colors.append('turquoise')
fig, ax = plt.subplots(1)
ax.set_title("Algorithm Selection Trials")
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.set_ylabel(est1.inferred_score_metric[0])
s.plot.bar(ax=ax, color=colors, edgecolor='black')
ax.axhline(y=s.mean(), color='black', linewidth=0.5)
plt.show()
Following Algorithm Selection, Adaptive Sampling aims to find the smallest dataset sample that can be created without compromising validation set score for the chosen model. Given the small size of the training data (948 samples), Adaptive Sampling is not relevant here.
# Each trial is a tuple of
# (algorithm, no. samples, no. features, mean CV score, hyperparameters,
# all CV scores, total CV time (s), memory usage (Gb))
trials = est1.adaptive_sampling_trials_
scores = [x[3] for x in trials]
n_samples = [x[1] for x in trials]
y_margin = 0.10 * (max(scores) - min(scores))
fig, ax = plt.subplots(1)
ax.set_title("Adaptive Sampling ({})".format(trials[0][0]))
ax.set_xlabel('Dataset sample size')
ax.set_ylabel(est1.inferred_score_metric[0])
ax.grid(color='g', linestyle='-', linewidth=0.1)
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.plot(n_samples, scores, 'k:', marker="s", color='teal', markersize=3)
plt.show()
After finding a sample subset, the next step is to find a relevant feature subset to maximize score for the chosen algorithm. The Feature Selection step identifies the smallest feature subset that does not compromise on the score of the chosen algorithm. The orange line shows the optimal number of features chosen by Feature Selection (in this case, retaining only about 4 000 of the more than 11 000 available words).
# Each trial is a tuple of
# (algorithm, no. samples, no. features, mean CV score, hyperparameters,
# all CV scores, total CV time (s), memory usage (Gb))
trials = est1.feature_selection_trials_
scores = [x[3] for x in trials]
n_features = [x[2] for x in trials]
y_margin = 0.10 * (max(scores) - min(scores))
fig, ax = plt.subplots(1)
ax.set_title("Feature Selection Trials")
ax.set_xlabel("Number of Features")
ax.set_ylabel(est1.inferred_score_metric[0])
ax.grid(color='g', linestyle='-', linewidth=0.1)
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.plot(n_features, scores, 'k:', marker="s", color='teal', markersize=3)
ax.axvline(x=len(est1.selected_features_names_), color='orange', linewidth=2.0)
plt.show()
Hyperparameter Tuning is the last stage of the AutoML pipeline, and focuses on improving the chosen algorithm's score on the reduced dataset (after Adaptive Sampling and Feature Selection). We use a novel algorithm to search across many hyperparameters dimensions, and converge automatically when optimal hyperparameters are identified. Each trial in the graph below represents a particular hyperparameters configuration for the selected model.
# Each trial is a tuple of
# (algorithm, no. samples, no. features, mean CV score, hyperparameters,
# all CV scores, total CV time (s), memory usage (Gb))
trials = est1.tuning_trials_
scores = [x[3] for x in reversed(trials)]
y_margin = 0.10 * (max(scores) - min(scores))
fig, ax = plt.subplots(1)
ax.set_title("Hyperparameter Tuning Trials")
ax.set_xlabel("Iteration $n$")
ax.set_ylabel(est1.inferred_score_metric[0])
ax.grid(color='g', linestyle='-', linewidth=0.1)
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.plot(range(1, len(trials) + 1), scores, 'k:', marker="s", color='teal', markersize=3)
plt.show()
The Oracle AutoMLx tool also supports a user given time budget in seconds. Given the small size of this dataset, we give a small time budget of 10 seconds using the time_budget argument.
est2 = automl.Pipeline(task="classification")
est2.fit(X_train, y_train, X_valid, y_valid, cv=None, col_types=['text'], time_budget=10)
y_predict = est2.predict(X_test)
score_default = f1_score(y_test, y_predict, average="micro")
print('F1 micro Score on test data: {:3.3f}'.format(score_default))
[2023-02-10 12:36:44,349] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:36:44,389] [automl.pipeline] Dataset shape: (948, 1)
[2023-02-10 12:36:44,478] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:36:46,849] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (948, 11026), cv: None
[2023-02-10 12:36:52,358] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:36:52,359] [automl.pipeline] Running Model Selection
[2023-02-10 12:36:56,738] [automl.pipeline] Time budget exhausted in Algorithm Selection; defaulting to GaussianNB
[2023-02-10 12:36:56,826] [automl.pipeline] Model Selection completed. Selected model: ['GaussianNB']
[2023-02-10 12:36:56,827] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (948, 11026), Valid Shape: (238, 11026), CV: None, Class counts: [237 237 237 237]
[2023-02-10 12:36:56,906] [automl.pipeline] Timebudget exhausted. Skipping Adaptive Sampling
[2023-02-10 12:36:56,931] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (948, 11026), Valid Shape: (238, 11026), CV: None, Class counts: [237 237 237 237]
[2023-02-10 12:36:56,932] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (948, 11026)
[2023-02-10 12:36:57,005] [automl.pipeline] Timebudget exhausted. Skipping Feature Selection
[2023-02-10 12:36:57,006] [automl.pipeline] Using all features: Index(['00', '000', '0000', '00000', '00041032', '0004422', '001230', '001321',
'0029', '004418',
...
'zinc', 'zion', 'zip', 'zipping', 'zisfein', 'zmodem', 'zone', 'zoo',
'zoology', 'zrepachol'],
dtype='object', length=11026)
[2023-02-10 12:36:57,038] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (948, 11026)
[2023-02-10 12:36:57,110] [automl.pipeline] Timebudget exhausted. Skipping Hyperparameter Optimization for GaussianNB
[2023-02-10 12:36:57,164] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:37:03,594] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:37:03,723] [automl.pipeline] AutoML completed. Time taken - 18.579 sec
F1 micro Score on test data: 0.820
By default, the score metric is set to neg_log_loss for classifcation and neg_mean_squared_error for regression.
The user can also choose another scoring metric. The list of possible metrics is given by:
Here, we ask AutoML to optimize for the 'f1_micro' scoring metric.
est3 = automl.Pipeline(task="classification", score_metric='f1_micro')
est3.fit(X_train, y_train, X_valid, y_valid, cv=None, col_types=['text'], time_budget=60)
y_predict = est3.predict(X_test)
score_default = f1_score(y_test, y_predict, average="micro")
print('F1 micro Score on test data: {:3.3f}'.format(score_default))
[2023-02-10 12:37:07,702] [automl.pipeline] Random state (7) is used for model builds [2023-02-10 12:37:07,733] [automl.pipeline] Dataset shape: (948, 1) [2023-02-10 12:37:07,819] [automl.pipeline] Running Auto-Preprocessing [2023-02-10 12:37:10,382] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (948, 11026), cv: None [2023-02-10 12:37:16,069] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features [2023-02-10 12:37:16,070] [automl.pipeline] Running Model Selection
Time budget exceeded by 2.86s, resetting XEngine Timebudget Exceeded or Timedout completed 8/9, 0 tasks timedout
[2023-02-10 12:38:10,677] [automl.pipeline] Model Selection completed. Selected model: ['TorchMLPClassifier']
[2023-02-10 12:38:10,679] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (948, 11026), Valid Shape: (238, 11026), CV: None, Class counts: [237 237 237 237]
[2023-02-10 12:38:10,750] [automl.pipeline] Timebudget exhausted. Skipping Adaptive Sampling
[2023-02-10 12:38:10,775] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (948, 11026), Valid Shape: (238, 11026), CV: None, Class counts: [237 237 237 237]
[2023-02-10 12:38:10,776] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (948, 11026)
[2023-02-10 12:38:10,846] [automl.pipeline] Timebudget exhausted. Skipping Feature Selection
[2023-02-10 12:38:10,847] [automl.pipeline] Using all features: Index(['00', '000', '0000', '00000', '00041032', '0004422', '001230', '001321',
'0029', '004418',
...
'zinc', 'zion', 'zip', 'zipping', 'zisfein', 'zmodem', 'zone', 'zoo',
'zoology', 'zrepachol'],
dtype='object', length=11026)
[2023-02-10 12:38:10,879] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (948, 11026)
[2023-02-10 12:38:10,950] [automl.pipeline] Timebudget exhausted. Skipping Hyperparameter Optimization for TorchMLPClassifier
[2023-02-10 12:38:10,996] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:39:05,506] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:39:05,613] [automl.pipeline] AutoML completed. Time taken - 117.150 sec
F1 micro Score on test data: 0.908
For a variety of decision-making tasks, getting only a prediction as model output is not sufficient. A user may wish to know why the model outputs that prediction, or which data features are relevant for that prediction. For that purpose the Oracle AutoMLx solution defines the MLExplainer object, which allows to compute a variety of model explanations
The MLExplainer object takes as argument the trained model, the training data and labels, as well as the task.
explainer = automl.MLExplainer(est1,
X_train,
y_train,
target_names=['sci.crypt', 'sci.electronics', 'sci.med', 'sci.space'],
task="classification",
col_types=["text"])
For text classification tasks, since we first extract tokens (features), the Oracle AutoMLx solution offers a single way to compute a notion of token importance: Global Token Importance. The notion of Global Token Importance intuitively measures how much a token impacts the model's predictions (relative to the provided train labels). This notion of token importance considers each token independently from all other tokens. Tokens are the most fine-grained building blocks of the NLP model, such as sentences, words, or characters.
We use a permutation-based method to successively measure the importance of each token (feature). Such a method therefore runs in linear time with respect to the number of features (tokens) in the dataset.
The method explain_model() allows to compute such feature importances. It also provides 95% confidence intervals for each token importance attribution.
result_explain_model_default = explainer.explain_model()
There are two options to show the explanation's results:
to_dataframe() will return a dataframe of the results.show_in_notebook() will show the results as a bar plot.The features are returned in decreasing order of importance.
result_explain_model_default.to_dataframe(n_tokens=20)
| token | attribution | upper_bound | lower_bound | |
|---|---|---|---|---|
| 0 | security | 0.000062 | 0.000090 | 0.000035 |
| 1 | space | 0.000024 | 0.000036 | 0.000012 |
| 2 | orbit | 0.000021 | 0.000021 | 0.000021 |
| 3 | prb | 0.000020 | 0.000024 | 0.000017 |
| 4 | gtoal | 0.000017 | 0.000033 | 0.000001 |
| 5 | encryption | 0.000016 | 0.000019 | 0.000013 |
| 6 | moon | 0.000016 | 0.000017 | 0.000015 |
| 7 | aurora | 0.000011 | 0.000013 | 0.000009 |
| 8 | tapped | 0.000011 | 0.000012 | 0.000010 |
| 9 | circuit | 0.000008 | 0.000009 | 0.000008 |
| 10 | disease | 0.000007 | 0.000008 | 0.000005 |
| 11 | alaska | 0.000006 | 0.000007 | 0.000006 |
| 12 | crypto | 0.000005 | 0.000005 | 0.000005 |
| 13 | launch | 0.000005 | 0.000005 | 0.000004 |
| 14 | shuttle | 0.000005 | 0.000006 | 0.000004 |
| 15 | power | 0.000004 | 0.000005 | 0.000004 |
| 16 | pitt | 0.000004 | 0.000004 | 0.000004 |
| 17 | encrypted | 0.000004 | 0.000004 | 0.000003 |
| 18 | clipper | 0.000004 | 0.000004 | 0.000003 |
| 19 | spacecraft | 0.000004 | 0.000004 | 0.000003 |
result_explain_model_default.show_in_notebook(n_tokens=20)
For text classification tasks, since we first extract tokens (features), the Oracle AutoMLx solution offers a single way to compute a notion of token importance: Local Token Importance. The notion of Local Token Importance intuitively measures how much a token impacts an instance's predictions (relative to the provided train labels). This notion of token importance considers each token independently from all other tokens.
By default we use a surrogate method to successively measure the importance of each token in a given instance. Such a method therefore runs in linear time with respect to the number of features in the dataset.
The method explain_prediction() allows to compute such feature importances. It also provides 95% confidence intervals for each feature importance attribution.
index = 0
result_explain_prediction_default = explainer.explain_prediction(X_train.iloc[index:index + 1, :])
There are two options to show the explanation's results:
to_dataframe() will return a dataframe of the results.show_in_notebook() will show the results as a bar plot.The features are returned in decreasing order of importance.
result_explain_prediction_default[0].to_dataframe()
| Attribution | Token | Target | |
|---|---|---|---|
| 0 | 1.347056e-01 | space | 14 |
| 1 | 1.291750e-01 | location | 14 |
| 2 | 1.062831e-01 | rattling | 14 |
| 3 | 7.680890e-02 | launch | 14 |
| 4 | 6.210265e-02 | wrench | 14 |
| ... | ... | ... | ... |
| 83 | 2.211368e-09 | com | 14 |
| 84 | -1.906403e-09 | Thiokol | 14 |
| 85 | -5.362070e-10 | had | 14 |
| 86 | 9.392158e-12 | I | 14 |
| 87 | 4.174225e-12 | some | 14 |
88 rows × 3 columns
result_explain_prediction_default[0].show_in_notebook()